Scattering vs. discrete cosine transform features in visual speech processing
نویسندگان
چکیده
Appearance-based feature extraction constitutes the dominant approach for visual speech representation in a variety of problems, such as automatic speechreading, visual speech detection, and others. To obtain the necessary visual features, typically a rectangular region-of-interest (ROI) containing the speaker’s mouth is first extracted, followed, most commonly, by a discrete cosine transform (DCT) of the ROI pixel values and a feature selection step. The approach, although algorithmically simple and computationally efficient, suffers from lack of DCT invariance to typical ROI deformations, stemming, primarily, from speaker’s head pose variability and small tracking inaccuracies. To address the problem, in this paper, the recently introduced scattering transform is investigated as an alternative to DCT within the appearance-based framework for ROI representation, suitable for visual speech applications. A number of such tasks are considered, namely, visual-only speech activity detection, visual-only and audio-visual sub-phonetic classification, as well as audio-visual speech synchrony detection, all employing deep neural network classifiers with either DCT or scattering-based visual features. Comparative experiments of the resulting systems are conducted on a large audio-visual corpus of frontal face videos, demonstrating, in all cases, the scattering transform superiority over the DCT.
منابع مشابه
Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models
In this paper we describe a speech processing system for Polish which utilizes both acoustic and visual features and is based on Dynamic Bayesian Network (DBN) models. Visual modality extracts information from speaker lip movements and is based alternatively on raw pixels and discrete cosine transform (DCT) or Active Appearance Model (AAM) features. Acoustic modality is enhanced by using two pa...
متن کاملVariability Analysis of Discrete Cosine Transform Coefficient (dctc) Features for Speech Processing
VARIABILITY ANALYSIS OF DISCRETE COSINE TRANSFORM COEFFICIENT (DCTC) FEATURES FOR SPEECH PROCESSING Bingjun Dai Old Dominion University, 1998 Director: Dr. Stephen A. Zahorian In this research, the variability of Discrete Cosine Transform Coefficient (DCTC) features was investigated. Additionally, a new pitch-synchronous processing method was explored to increase the stability of features and t...
متن کاملSpeaker - Independent Visual Lip Activity Detection for Human - Computer Interaction
Recently there is an increased interest in using the visual features for improved speech processing. Lip reading plays a vital role in visual speech processing. In this paper, a new approach for lip reading is presented. Visual speech recognition is applied in mobile phone applications, human-computer interaction and also to recognize the spoken words of hearing impaired persons. The visual spe...
متن کاملA Comparison of Visual Features for Audio-Visual Automatic Speech Recognition
The use of visual information from speaker’s mouth region has shown to improve the performance of Automatic Speech Recognition (ASR) systems. This is particularly useful in presence of noise, which even in moderate form severely degrades the speech recognition performance of systems using only audio information. Various sets of features extracted from speaker’s mouth region have been used to im...
متن کاملA system for audio-visual speech recognition
In this work, a system of audio visual speech recognition will be presented. A new hybrid visual feature combination, which is suitable for audio -visual speech recognition was implemented. The features comprise both the shape and the appearance of lips, the dimensional reduction is applied using discrete cosine transform (DCT). A large visual speech database of the German language has been ass...
متن کامل